03. More on the Policy

More on the Policy

In the previous video, you learned how the agent could use a simple neural network architecture to approximate a stochastic policy. The agent passes the current environment state as input to the network, which returns action probabilities. Then, the agent samples from those probabilities to select an action.

Neural network that encodes action probabilities ([Source](https://blog.openai.com/evolution-strategies/))

Neural network that encodes action probabilities (Source)

The same neural network architecture can be used to approximate a deterministic policy. Instead of sampling from the action probabilities, the agent need only choose the greedy action.

## Quiz

In the video above, you learned that the neural network that approximates the policy takes the environment state as input. The output layer returns the probability that the agent should select each possible action. Which of the following is a valid activation function for the output layer?

SOLUTION: softmax

## What about continuous action spaces?

The CartPole environment has a discrete action space. So, how do we use a neural network to approximate a policy, if the environment has a continuous action space?

As you learned above, in the case of discrete action spaces, the neural network has one node for each possible action.

For continuous action spaces, the neural network has one node for each action entry (or index). For example, consider the action space of the bipedal walker environment, shown in the figure below.

Action space of `BipedalWalker-v2` ([Source](https://github.com/openai/gym/wiki/BipedalWalker-v2))

Action space of BipedalWalker-v2 (Source)

In this case, any action is a vector of four numbers, so the output layer of the policy network will have four nodes.

Since every entry in the action must be a number between -1 and 1, we will add a tanh activation function to the output layer.

As another example, consider the continuous mountain car benchmark. The action space is shown in the figure below. Note that for this environment, the action must be a value between -1 and 1.

Action space of `MountainCarContinuous-v0` ([Source](https://github.com/openai/gym/wiki/MountainCarContinuous-v0))

Action space of MountainCarContinuous-v0 (Source)

## Quiz

Consider the MountainCarContinuous-v0 environment. Which of the following describes a valid output layer for the policy? (Select the option that yields valid actions that can be passed directly to the environment without any additional preprocessing.)

SOLUTION: Layer size: 1, Activation function: tanh